Supplementary Material: FoveaNet: Perspective-aware Urban Scene Parsing

ثبت نشده

چکیده

We make use of a fully convolutional network (FCN) [4] as a baseline model for parsing the scene images. We follow Chen et al. [1] and use the vanilla ResNet-101 [3] to initialize the FCN model. Preserving high spatial resolution of feature maps is very important for accurately segmenting small objects. Therefore, we disable the last down-sampling layer by setting its stride as 1. This increases size of the feature maps output by res5 3b3 to 1/16 of the input image size (without this modification the size of output feature maps is only 1/32 of the input image size). The dilation factor of convolution kernels in the following residual blocks (from res5 a to 5 c) is set to 2, effectively enlarging the field-ofview (FoV) of filters therein. In order to distinguish neighboring pixels well for semantic parsing, we remove the top pooling layer in ResNet-101 considering pooling operation would unfavorably “smooth” features of neighboring pixels. We add a convolutional score layer on top of the FCN model which outputs pixel-level dense category prediction for the input image. The score layer has a convolutional kernel size of 5, and has a convolutional stride of 16 pixels. Such configuration may lead to blurred details in its up-sampled output prediction. To further enhance quality of the prediction, we follow Long et al. [4] and add skip connections between the score layer and following three bottom layers: res3 b3, pool1 and conv1. We add a 1 × 1 convolution layer on top of each of these bottom layers that produces three additional predictions. These predictions are then fused with 2×, 4× and 8× up-sampling of score layer output respectively, and give the final parsing prediction. The overall structure of our baseline FCN model is illustrated in Figure 1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supplementary Material: Deep Image Harmonization

To validate the effectiveness of our joint training scheme, we also try an alternative of incorporating an off-the-shelf state-of-the-art scene parsing model [3] into our single encoder-decoder harmonization framework to provide semantic information. This network architecture is shown in Figure 1. We show quantitative comparisons on our synthesized dataset in Table 1 and 2. The MSE and PSNR of ...

متن کامل

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing – Supplementary Material

In this supplementary material, Section 2 details the 3D annotation for CAD models and real images as well as our approach to compute the car yaw angle given a 3D skeleton. Next, in Section 3, we present the instance segmentation algorithm in detail for our experiments on PASCAL3D+. Finally, we provide more quantitative results on KITTI-3D in Section 4 and demonstrate more qualitative results o...

متن کامل

Supplemental Materials for “ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information”

We show more intermediate results in this supplementary material to give the reader a better understanding of our method. In Section 1 we provide results for our method when adjusting the scoring distance for considering a proposed detection as a true positive. In Section 2, we show how ClusterNet and FoveaNet work together to improve the performance. In Section 3, we show qualitative results a...

متن کامل

Recurrent Scene Parsing with Perspective Understanding in the Loop

Objects may appear at arbitrary scales in perspective images of a scene, posing a challenge for recognition systems that process an image at a fixed resolution. We propose a depth-aware gating module that adaptively chooses the pooling field size in a convolutional network architecture according to the object scale (inversely proportional to the depth) so that small details can be preserved for...

متن کامل